Stratification of diabetes in the context of comorbidities, using representation learning and topological data analysis

Diabetes is a heterogenous, multimorbid disorder with a large variation in manifestations, trajectories, and outcomes. The aim of this study is to validate a novel machine learning method for the phenotyping of diabetes in the context of comorbidities. Data from 9967 multimorbid patients with a new diagnosis of diabetes were extracted from Clinical Practice Research Datalink. First, using BEHRT (a transformer-based deep learning architecture), the embeddings corresponding to diabetes were learned. Next, topological data analysis (TDA) was carried out to test how different areas in high-dimensional manifold correspond to different risk profiles. The following endpoints were considered when profiling risk trajectories: major adverse cardiovascular events (MACE), coronary artery disease (CAD), stroke (CVA), heart failure (HF), renal failure (RF), diabetic neuropathy, peripheral arterial disease, reduced visual acuity and all-cause mortality. Kaplan Meier curves were plotted for each derived phenotype. Finally, we tested the performance of an established risk prediction model (QRISK) by adding TDA-derived features. We identified four subgroups of patients with diabetes and divergent comorbidity patterns differing in their risk of future cardiovascular, renal, and other microvascular outcomes. Phenotype 1 (young with chronic inflammatory conditions) and phenotype 2 (young with CAD) included relatively younger patients with diabetes compared to phenotypes 3 (older with hypertension and renal disease) and 4 (older with previous CVA), and those subgroups had a higher frequency of pre-existing cardio-renal diseases. Within ten years of follow-up, 2592 patients (26%) experienced MACE, 2515 patients (25%) died, and 2020 patients (20%) suffered RF. QRISK3 model’s AUC was augmented from 67.26% (CI 67.25–67.28%) to 67.67% (CI 67.66–67.69%) by adding specific TDA-derived phenotype and the distances to both extremities of the TDA graph improving its performance in the prediction of CV outcomes. We confirmed the importance of accounting for multimorbidity when risk stratifying heterogenous cohort of patients with new diagnosis of diabetes. Our unsupervised machine learning method improved the prediction of clinical outcomes.


Data and study population
This study has been performed on a CPRD Gold cut containing 4 million patients registered between 1985 and 2015 with linkage to HES (Hospital Episode Statistics) for secondary care records and ONS for death register records.
To fully cover patients' health journeys, only patients with hospital and ONS data linkage were included in the study. Each practice in CPRD has a date after which data was up-to-standard; only events recorded after this date were included in the study.
In this study, we are interested in patients with new onsets of diabetes mellitus (DM), where DM is defined as per its CALIBER code. To have broad coverage of patients' medical history and similarly to BEHRT (4), only subjects with at least 5 visits in their records are considered in this study. Moreover, to provide our models with enough past events to learn from, we included only those subjects whose diagnosis of DM was registered at least as nth comorbidity (i.e. a DM occurrence is only considered if at least n-1 other diseases were reported before it). In this paper, we used n=7, which is large enough to enable the model to learn from enough past events and still produce a significant number of patients. This results in 9,967 patients with diabetes mellitus (DM).
In traditional research using EHR data (including the ones using ML), experts define several features/markers at a given baseline to represent the patients. Recent developments in DL, however, provided us with models that can learn optimal representations (e.g., of individuals or diseases) from raw or minimally processed data, with minimal need for expert guidance (and hence the name representation learning). The complexities of EHR data, which can result in a vast universe of possible representations, led to the growing popularity and superior performance of representation learning in this field.

BEHRT
To extract contextualised representations from EHR, we used the BEHRT architecture, which has been shown to achieve state-of-the-art performance in predicting the onset of new diseases (4). In addition to diagnoses, medications, and age, BEHRT uses encodings for positions (counter of the visit) and segments (to distinguish between consecutive visits). At any holiday, the sum of all these embeddings will result in an embedding (shown at the bottom of Figure 1S (I)), which will be the learned representation of one's EHR at that point. Figure 1S shows BEHRT's Transformer-based architecture. It is first pre-trained by the masked language model (MLM) task (similar to the original BERT architecture (5)), to learn the network parameters (including the disease and drug embeddings) that can predict the masked disease tokens. When used for downstream tasks such as disease predictions, the model finetunes the weights pre-trained in the MLM task and learns the weights for the classification layer.
In this study, we used the embeddings obtained using the MLM task (shown across layers from {Ei} to {Ti} vectors, in Figure 1S (II)) to have a contextualised representation of one's EHR at a given point in time which is not biased towards a particular clinical outcome. Note that, the closer we get to the Ti layer (i.e., further away from the Ei layer), the more contextualised the embeddings are likely to be (6). Therefore, we expect (i.e., our hypothesis) to observe different embeddings for the same token (i.e. a given disease or medication) due to differences in their surrounding sequence (i.e., context).
Note that to avoid any potential leakage, we define for each patient, the baseline t as the date of occurrence of DM (as n th comorbidity) and will only feed to BEHRT the encounters which occurred on or before t. To feed every included patients' EHR to BEHRT -given the inconsistencies in their length -we will pad the records that are shorter than the required length (fixed at 256 tokens in this study).
This architecture has been shown to capture a context-aware representation of words/tokens in the Natural language processing field (6). The aim of this study was to show that these embeddings provide an accurate representation of the current state of every patient' health condition and that they can be used to mine different patterns of a given disease. Figure 1S: Encoding EHR data using BEHRT (credits to (4)).

Topological data analysis
In this study, we used a TDA technique called Mapper (7), which operates as follows: a) it projects the data from the embedding space to a lower dimensional space, b) it covers the data in the projection space using overlapping hypercubes, c) it performs clustering of the projected data within each hypercube. Each obtained cluster will correspond to a node in the graph, if a data point belongs to more than one cluster, these clusters will be linked using an edge. Figure 2S illustrates this process. We used a Python implementation of Mapper called Kepler Mapper (8,9).
In this study we used t-SNE as our mapping space as it has been shown to perverse many existing relationships in a lower dimensional space (10). This has been done using the following parameters set empirically: we used K-Means as a clustering algorithm with 3 clusters and 20 hypercubes with a 50% overlap proportion. These parameters have been set to obtain a fully connected graph network structure; changing these parameters will generally not substantially alter the structure of the graph; however significantly reducing the overlap proportion will results in disjoint systems. Similarly, significantly reducing the number of hypercubes will result in a very small chart which may not be useful for untangling complex relationships. It is, therefore generally necessary to fine-tune the parameters of the TDA algorithm until the obtention of a detailed and succinct description of the data (11). Figure 2S: Illustration of TDA Mapper algorithm

Graph partitioning method
The obtained TDA graph is partitioned using the multilevel k-way partitioning algorithm, which cuts the graph size by collapsing its nodes and edges (coarsening step), finds a k-way partition of the condensed graph. Then it constructs a k-way partition for the original graph by projecting and refining the partition to successively finer graphs (uncoarsening step) (12,13). In this study, we used the implementation provided by the METIS graph partitioning software (14).
This algorithm segments the graph into n contiguous regions. To determine the optimal value of n, we increment it until we reach the value for which the difference between the top diseases between each pair of obtained clusters is not significant anymore (P-values of the corresponding Chi-squared test greater than 0.05).

Handling missing data
Since the data is not necessarily missing at random and may be related to the unobserved values themselves, imputing missing data may not be the optimal practice, as it may introduce bias and affect the validity of the summary statistics. For this reason, only patients with complete data are considered when producing counts for variables with missing data (such as BMI). It is worth noting that transformers, like sequence-to-sequence models, do not have fixed input or output format or length when developing the model architecture. They take a sequence of varying lengths and output a sequence of varying lengths, making missing data irrelevant. All information becomes a vector in a sequence of vectors. Thus, the method of including only cases without missing data does not challenge the model architecture.